Skip to content

Conversation

@wackywendell
Copy link
Contributor

@wackywendell wackywendell commented Sep 15, 2025

Summary

Begins addressing #367 and #342 by adding support for parsing the types in YAML Simple Extension Files into Rustic types - with validity enforced. This includes a string text parser handling built-in types, compound types, named structs, custom types, and validated parameter constraints in the Simple Extension YAML files.

Scope

  • Types-only: no functions or call validation yet.
  • Public API exposes parsed types (ExtensionFile, Registry, CustomType, ConcreteType) and enforces validation of those on creation / read.

Key Changes

  • Type system
    • New BuiltinType, CompoundType, ConcreteType, CustomType with Display/round‑trip support for alias and named‑struct structures.
    • Parameter constraints: data type, integer (with min/max), enum (validated/deduped), boolean, string.
    • Parsing to and from the YAML structures (TryFrom<TypeParamDefsItem>, Parse<RawType>)
  • File/Registry type: abstraction for handling YAML files
  • Context and proto glue
    • Separates out ProtoContext from Context, to distinguish between things needed for Protobuf parsing (ProtoContext)
  • Type expression parser
    • Parses simple, user‑defined (u!Name) and type variables; visits extension references for linkage bookkeeping.
  • Build/CI
    • parse feature includes serde_yaml; include!(extensions.in) is gated behind extensions feature.
    • Aligns actions/checkout to v4, updates Cargo dependency set, and bumps the substrait submodule.

Compatibility Notes

  • New trait bound ProtoContext on proto parsing that previously required only Context.
  • extensions.in now compiled only with features=["extensions"].
  • Minimal, types-only round‑trip implemented; other sections remain empty when converting back to text.

Testing

  • New unit tests cover:
    • Type parsing and round‑trip for alias and named‑struct.
    • Parameter constraint handling including enum validation and integer bounds (with current truncation behavior).
    • Registry creation and type lookup; core registry smoke test behind features=["extensions"].

@wackywendell wackywendell changed the title Add Initial Extension Support feat: Add Initial Extension Support Oct 10, 2025
@wackywendell wackywendell changed the title feat: Add Initial Extension Support feat: add initial extension support Oct 10, 2025
@wackywendell wackywendell marked this pull request as ready for review October 20, 2025 15:32
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but still haven't gone through most of the PR. Just wanted to flush what I have so far, but I will review more later. Thanks!

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, but excited to get this in! Sorry that this PR has been sitting so long 😅

Refactored ProtoContext trait (removed), by replacing with a concrete type.
Some other small refactoring simplifications and code cleanup.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. Had to run so I didn't get to entirely finish the review but will revisit later. Thanks

/// Integer parameter (e.g., precision, scale)
Integer(i64),
/// Type parameter (nested type)
Type(ConcreteType),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to point out that the spec technically allows for other type parameters:

  message Parameter {
    oneof parameter {
      // Explicitly null/unspecified parameter, to select the default value (if
      // any).
      google.protobuf.Empty null = 1;

      // Data type parameters, like the i32 in LIST<i32>.
      Type data_type = 2;

      // Value parameters, like the 10 in VARCHAR<10>.
      bool boolean = 3;
      int64 integer = 4;
      string enum = 5;
      string string = 6;
    }
  }

That being said, I would prefer you leave what you have as is since the PR is fairly large already. Just a note for future work!

/// Parameterized builtin types that require non-type parameters, e.g. numbers
/// or enum
#[derive(Clone, Debug, PartialEq)]
pub enum BuiltinParameterized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to separate out builtin parameterized types which don't take in type parameters from those that do? Not saying this is necessarily wrong, just curious to understand why. Substrait-java treats both e.g. decimal (which takes only an int param) and list as implementors of BaseParameterizedType.

/// Sub-second precision digits
precision: i32,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these wonderful comments 🙏

/// Unified representation of simple builtin types (primitive or parameterized).
/// Does not include container types like List, Map, or Struct.
#[derive(Clone, Debug, PartialEq)]
pub enum BuiltinKind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to consistently use Type here? I.e. BuiltinType.

/// Type parameters (e.g., for generic types)
pub parameters: Vec<TypeParam>,
/// Concrete structure definition, if any
pub structure: Option<ConcreteType>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note (haven't checked yet). It would be good to make sure there is a test specifically for structure.

This is kind of a rarer feature which in which we only recently introduced handling for substrait-go and is still a WIP for substrait-java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah there kind of implicitly is a test for this because of the extension_types.yaml file in the core extensions, which the registry.rs at least checks is able to load everything without an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, and doesn't necessarily have to be this PR, but what do you think about just making the Context and Parse traits pub(crate)? As I understand it, the only public API necessary for parsing are just the impls for ExtensionFile.

self.simple_extensions
.get(anchor)
.ok_or(ContextError::UndefinedSimpleExtension(*anchor))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this function used anywhere yet. Wondering if it makes sense to just switch the return type to be Option<&Urn>? I could imagine contexts in which we might want to use this function but not consider the "not found" case an error.

}

impl<C: Context> Parse<C> for proto::Version {
impl Parse<ExtensionAnchors> for proto::Version {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does parsing Version require ExtensionAnchors context, or any context at all? Could imagine either passing around () for empty context, or having two traits, one called something like Parse and another called ParseWith.

I'm just dumping some thoughts though :) this doesn't have to be in this PR, especially considering that Parse already existed.

}
other => panic!("unexpected error type: {other:?}"),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked myself yet if it already exists but just noting that it would be good to have tests somewhere that ensure all of the core extension files parse correctly. Could be just as simple as showing they don't result in an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you already got it in the registry file, nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants